Structured Query Reformulations in Commerce Search 



Sreenivas Gollapudi Samuel leong Anitha Kannan 

Microsoft Research Silicon Valley 
{sreenig,saieong,ankannan}@microsoft.com 



ABSTRACT 

Recent work in commerce search has shown that understanding the 
semantics in user queries enables more effective query analysis and 
retrieval of relevant products. However, due to lack of sufficient 
domain knowledge, user queries often include terms that cannot be 
mapped directly to any product attribute. For example, a user look- 
ing for designer handbags might start with such a query because 
she is not familiar with the manufacturers, the price ranges, and/or 
the material that gives a handbag designer appeal. Current com- 
merce search engines treat terms such as designer as keywords 
and attempt to match them to contents such as product reviews and 
product descriptions, often resulting in poor user experience. 

In this study, we propose to address this problem by reformu- 
lating queries involving terms such as designer, which we call 
modifiers, to queries that specify precise product attributes. We 
learn to rewrite the modifiers to attribute values by analyzing user 
behavior and leveraging structured data sources such as the prod- 
uct catalog that serves the queries. We first produce a probabilistic 
mapping between the modifiers and attribute values based on user 
behavioral data. These initial associations are then used to retrieve 
products from the catalog, over which we infer sets of attribute val- 
ues that best describe the semantics of the modifiers. We evaluate 
the effectiveness of our approach based on a comprehensive Me- 
chanical Turk study. We find that users agree with the attribute 
values selected by our approach in about 95% of the cases and they 
prefer the results surfaced for our reformulated queries to ones for 
the original queries in 87% of the time. 

Categories and Subject Descriptors 

H. 3.3 [Information Storage and Retrieval]: Information Search 
and Retrieval — Query formulation; H.2.8 [Database Management] : 
Database Applications — Data mining 

General Terms 

Algorithms, Experimentation 

I. INTRODUCTION 

There has been tremendous growth in the amount of commerce 
conducted over the web in the past decade. In a recent survey, Corn- 
Score reported a record-breaking $44.3 billion e-commerce retail 
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spending in the U.S. in the first quarter of 2012, up 17% from the 
previous year IT). Nearly 70% of all Internet users have made at 
least one online purchase during this time. 

Whether through dedicated e-commerce sites such as Amazon or 
search engine verticals such as Google or Bing shopping, most on- 
line transactions begins with search. However, there are important 
differences between web search and commerce search. While web 
search is typically performed over unstructured data such as con- 
tents of webpages, commerce search is typically performed over 
structured data in the form of a product catalog. The catalog pro- 
vides rich semantics that can be associated with both queries and 
products, and can lead to more effective query analysis and rank- 
ing 1 9 |20| . For example, techniques exist to annotate keywords 
queries with type semantics, such as annotating [nikon digital 
cameras] as [brand :nikon category: digital cameras] |20|. 
These queries can then be used to search structured data sources 
and enable search engines to find more relevant products. 

However, not all commerce queries can be annotated with such 
clean semantics. Due to possible lack of domain knowledge, users 
often express their information needs with query terms that cannot 
be directly annotated using the structured data in the product cat- 
alog. For example, consider the query [designer handbags]. A 
user may issue this query to discover aspects such as brands and 
materials that constitute "designer" appeal to handbags, and expect 
the search engine to retrieve products that capture these nuances. 
However, there is no explicit type semantics that can be associ- 
ated with terms such as [designer] based on the catalog, leav- 
ing such terms as free tokens (as opposed to typed tokens such as 
[brand : nikon]) to be handled by the retrieval system. 

Current commerce search engine approaches this challenge by 
matching the free tokens as keywords. For this to work, the search 
engine must decide on the sources of information over which the 
matching is done. Typical sources include product descriptions and 
user reviews. There are several drawbacks to this solution. First, 
these sources could be noisy — a seller may have incentives to la- 
bel all of the handbags with positive terms such as "designer" and 
"stylish" to boost sales. Second, the information could be dated — a 
handbag that is considered designer a year ago may become blase 
today. Finally, relying on textual matches could adversely affect 
recall in cases where the free token is rare. 

Figure[T]shows an illustrative example that demonstrates the lim- 
itation of matching free tokens as keywords. In this figure, we 
show the results for the query [designer handbags] on Amazon, 
with search restricted to category Women ' s Handbags & Purses 
(results without restriction appear worse). In our opinion, the re- 
trieved products are poor reflections of characteristics of "designer" 
handbags. These products are likely retrieved due to the term "de- 
signer" in their titles. We believe better results can be retrieved 
through query reformulation, for example, by reformulating the 
query [designer handbags] to [gucci leather handbags]. 

There are many ways in which free tokens can change the in- 
tents of the queries, and thus the set of results to be retrieved. Our 
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Figure 1: Results for the query [designer handbags] on Ama- 
zon 



work focuses on free tokens that change the set of products to be 
retrieved as opposed to ones that change the nature of the products 
to be retrieved. We call the former modifiers. For example, in the 
category of televisions, the free token portable indicates that the 
users are likely looking for televisions that are light in weight and 
small in size, whereas the free token manual indicate that the users 
are looking for operation instructions for certain televisions and not 
televisions themselves. 

In this study, we propose a principled approach to reformulate 
queries containing modifiers to ones that specify precise attributes. 
Our intuition is that users who issue queries such as [designer 
handbags] will end up spending more time browsing [designer 
handbags] in their search session, and can thus inform us the at- 
tributes that give handbags "designer" appeal. We first learn to 
identify the set of modifiers among all free tokens. We then learn 
the association of different attribute values with modifiers by ana- 
lyzing the user browsing behavior in search sessions. These asso- 
ciations are then used to retrieve products from the product cata- 
log, over which we infer sets of attribute values that best describe 
the modifiers. We conduct a comprehensive study to evaluate our 
approach, and find that in 95% of the cases users agree with our 
selected attribute values for the modifiers and in 87% of the cases 
prefer the results based on our reformulated queries to the original 
ones. 

The remainder of the paper is organized as follows. We first re- 
view past work on commerce search and query reformulation in 
Section[2] We then explain our approach of combining user browse 
signals with structured data from product catalogs to produce can- 
didate reformulations in Section [3] We present a comprehensive 
user study conducted over the Mechnical Turk platform in Sec- 
tion [4] We conclude with our main findings and suggest future 
research directions in Section[5] 

2. RELATED WORK 

Our work is motivated by the problem of answering commerce 
queries against a product catalog used in a commercial shopping 
search engine such as those used in Amazon and Bing. As men- 
tioned in the Introduction, a commerce query can be thought of 
being composed of semantic tokens that can be annotated with cor- 
responding attributes, which we call typed tokens, and free tokens. 
As an example, for the query [prada designer handbags], the 
typed tokens are [brand: prada category : handbags] while the 
free token is [designer]. Much of the work in literature has fo- 
cused on inferring the attributes to be associated with typed tokens. 
In |15| , a conditional random field is trained to infer attributes cor- 
responding to query tokens, while in |20| , a probabilistic generative 



model is trained to infer the most likely complete annotation for the 
query. In both these methods, tokens that can not be mapped to an 
attribute is left as free token. In this paper, we assume access to 
a semantic parser such as |20| to understand the semantics of the 
tokens, and focus on the problem of learning to reformulate queries 
that contain modifiers into ones that specify precise attribute values 
that can well satisfy the users. 

When semantics is inferred for a subset of tokens, search is per- 
formed by combination of exact or approximate match of semantic 
tokens against the attribute values in the index and keyword search 
against the textual description of the product (using the free to- 
kens). As textual descriptions tend to be sparse or inaccurate, key- 
word search becomes ineffective. Thus, the problem has attracted 
attention from both the IR as well as the database communities. 
There has been a number of studies on answering keyword queries 
for both traditional and XML databases (4] [T|_|T0] [TT] [12] [T3| [14] 
|16[|17|[T8]|22| . A recent survey is given in |23|. They generally 
focus on three questions: how to efficiently retrieving tuples that 
contain the keywords from the database, how to find relations that 
can be joined to produce such tuples, and how to support efficient 
query processing that involves retrieval functions such as BM25. 
An essential assumption is that the keywords being searched ap- 
pear in the database somewhere. Our work addresses a different 
problem, where the keywords we are interested in may not be ex- 
plicitly mentioned in the database, and a mapping from keywords 
to attribute-value pairs has to be learned through user queries and 
their browse trails. Our work can complement, for example, the 
review-driven approach in |7] by providing an additional source of 
signals from user behavior. 

Related work in web search is primarily on query reformula- 
tions 1 15. 8 1 and document expansion [6 3 ]. Our work is primarily 
different in the sense that we study the mapping of free tokens in 
the query to structured attribute values describing the products in 
the catalog. 

3. MODEL AND FORMULATION 
3.1 Overview 

Our work is done in the context of a commerce search engine that 
answers commerce queries given a product catalog. We assume the 
existence of a semantic parser that identifies the attributes in the 
query and extracts their associated values, based on past work such 
as 1 20 1 . We call the processed queries annotated queries. For ex- 
ample, the query [gucci leather handbags] will be annotated 
as [category : handbags brand:gucci material : leather]. 
However, not all tokens in the query can be matched to an attribute. 
In particular, users may use terms such as designer that do not 
match any attribute value in the product catalog. Such tokens will 
be marked as free by the semantic parser. The goal of our work 
is to understand how to identify the free tokens that change the 
sets of products to be retrieved, called modifiers, and to learn to 
reformulate queries that contain modifiers into ones that specify 
precise attribute values that can well satisfy the users, such as from 
[category : handbags free : designer] to [category : handbags 
brand: gucci material : leather]. 

At a high level, our solution is to analyze the user browse ses- 
sions to discover common features of the products that can satisfy 
the modifiers. It consists of four steps. First, we analyze the ses- 
sions to generate labels for domains. Second, we identify the set of 
valid modifiers among all tokens that we cannot map to any product 
attributes. Third, we estimate the likelihood of each modifier be- 
ing associated with particular attribute values. Finally, we retrieve 
products from a database based on the identified attribute values. 



and generate query rewrites that are good representations of the 
retrieved products. We will go into each of these steps in further 
details in this section. First, we introduce notations that will be 
used throughout the section. 

Let P denote a database of products. Let A denote the set of 
attributes of the products, with \A\ = k. For each attribute a 6 A, 
let V a denote the set of valid values for attribute a. We denote the 
set of all valid attribute-value pairs, or AV pairs for short, by AV, 
defined by the set {{a,v) : a £ A,v £ V fl }. Each product p £ P is 
represented as a set of AV pairs, {(ai,V\), (02, V2), ■•• , iflki v k)}t 
one for each attribute. For a set of AV pairs 5, let P(S) denote 
the subset of products in P that match all of the specified attribute 
values. For example, if S = {(brand, sony), (diagonal size, 32)}, 
then P(S) represents all products with brand equals sony and with 
diagonal size equals 32. 

Let V denote the collection of user browse sessions. We call 
these sessions browse trails, where each trail is associated with a 
query and a sequence of websites visited by the user. As we will 
be analyzing these trails at the domain level, let D denote the set 
of domains. We denote the collection of all tokens in the queries 
by T. These tokens include both typed tokens expressed as AV 
pairs and free tokens that the parser cannot map to any attribute 
value, denoted by F, We consider some of these free tokens to 
be modifiers, i.e., tokens that restrict the subset of products to be 
retrieved. We denote the set of modifiers by M. Stated formally, 
the goal of our work is to find an algorithm that associates each 
modifier m 6 M with a set of AV pairs 5 C AV. Success will be 
measured empirically via human judgements, as described further 
in Section|4] 

3.2 Labeling Domains 

In order to discover the product features a user is interested in 
given a query containing a modifier, we start by discovering the 
products the user examined after her query. As search engines may 
fail to understand the query and do not surface the right results, 
we consider not only the page that a user clicked on but also the 
subsequent pages the user visited in the session. Such collection 
of interactions have been used in previous studies and are called 
browse trails (5l |19|[2T| . Using browse trails is especially impor- 
tant for understanding queries that contain modifiers, as our experi- 
ence suggests that search engines are especially poor in answering 
such queries. 

Ideally, we would like to find out the exact product(s) on each 
of the webpages on the browse trail; unfortunately, due to the large 
volume of data that needs to be processed, it is computationally 
infeasible to parse each of the webpages. Instead, we label the 
websites by propagating the query tokens, including both AV pairs 
and free tokens, along the browse trails. Further, to overcome 
data sparsity, we group websites by their host domains and gen- 
erate labels at the domain level. Intuitively, this is a reasonable ap- 
proach because users looking for [sony tv] will likely spend more 
time on |http:/ /www . sony . com than, say, |http://www . f rys . 
com , whereas users looking for [52 inch led tv] will spend more 
time on general merchant sites such as http : //www . bestbuy . 
com and http://www.nextag.com Likewise, users looking for 
[widescreen tv] will more often end up on merchants that have 
widescreen televisions in their catalog. A similar approach has 
been previously proposed in [ 19] and has been shown to produce 
good labels for domains. 

Stated formally, the goal of our first step is to take as input the 
collection of browse trails U and output the frequency counts c(t,d) 
of how often queries containing token t ET ends up visiting domain 
d 6 D. The frequency counts are computed using a variant of the 
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Table 1: Association between domains and AV pairs, grouped 
by attributes, in the television category 
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Table 2: Association between domains and the free token 
portable in the television category 



heavy hitter algorithm. In short, for each trail u, we create a set of 
token-domain pairs by pairing up each domain in u with each of the 
tokens in the query that u originated from. We then ran a sampling- 
based algorithm to determine approximate counts. Through careful 
accounting, the resulting frequency counts can be shown to be close 
to the true counts [19|. Given these counts, together with suitable 
normalization, we can compute for example the distribution of free 
tokens for a given domain d, V{f\d) for / 6 F. 

We illustrate the output of this computation through an example 
from the televisions category. In Table [T| we show the probabil- 
ity distribution of different brands and model numbers for three 
domains. These values tell us that visitors to the domain www. 
target . com often start with a query that specify a manufacturer of 
smaller-sized TVs (Westinghouse and Haier produce many portable 
TVs), whereas visitors to www . avsf orum . com are more mixed (Vizio 
and Pioneer produce TVs of all sizes). In addition, users visiting 
www. target . com typically do not begin their queries with model 
numbers, while those visiting www. avsf orums . com do, and coin- 
cidentally with model numbers that correspond to televisions with 
large screen sizes. 

Likewise, we can investigate the modifiers associated with each 
domain. In Table[2] we show the domains and their association with 
the modifier portable, as measured by P(portable|d), which 
can be interpreted as among all modifiers that are associated with a 
domain, what is the fraction of tokens that equal portable. We see 
that the domain www .target . com has the largest support for this 
modifier, while the forum site www.avsforum.com has relatively 
little support. 



At this juncture, one may wonder if we can simply stop and se- 
lect the domains that have high support for a free token, and select 
all the AV pairs that have high support in that domain to be the 
mapping of interest. For example, in light of Tables [T] and [2] for 
the free token portable, we include all the dominant AV pairs 
for |www . ta rget .com This approach has multiple issues. First, 
choosing dominant domains, and subsequently the AV pairs asso- 
ciated with that domains requires the introduction of threshold pa- 
rameters that require tuning. Second, and more importantly, even if 
these threshold choices were made, the resulting mapping between 
the modifier and the AV pairs might lack generalization because 
the resulting mapping may restrict the choices (such as restricted to 
certain model numbers or brands). For instance, in our running ex- 
ample, if only www . target . com was was selected, we will restrict 
the brands to Westinghouse and Haier, while brands such as Coby 
and Dynex also make portable televisions. 

To work around this limitation, we instead treat the frequency 
counts computed in this step as input and estimate a conditional 
distribution of the AV pairs given the free tokens. We then use the 
AV pairs with high conditional probabilities to retrieve products 
from the catalog. Finally, we use these products to identify the 
common features of the products and generate a reformulation for 
the modifier. 

Before, we proceed to explain these steps, we will describe in 
the next section how we employ these frequency counts to identify 
the free tokens that correspond to modifiers, tokens that influence 
the set of products to be retrieved. 

3.3 Identifying Modifiers 

The goal of this step is to identify the set of modifiers from 
amongst all free tokens. Intuitively speaking, we consider a free 
token to be a modifier if it helps the user distinguish what kinds 
of products she has in mind. As our analysis is aggregated at the 
domain level, we consider a free token to be distinguishing if the 
webpages the users went to are concentrated over few domains. 

Drawing on this intuition, we propose a scoring mechanism based 
loosely on the TF-IDF retrieval function. Specifically, for each free 
token / 6 F, let imp(f) denote its importance score, as given by 
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deD 



\D\ 



1.0 + df(f) 



where df(f) is number of domains in D for which the free token / 
has a non-zero weight. We then select the 10 free tokens with the 
highest scores for each category as candidate modifiers. 

Table [3] illustrates these modifiers ordered by their importance 
scores in a number of popular product categories. As the table 
shows, our method identifies many terms that restrict the set of 
products to be retrieved at the top. For example, we see modi- 
fiers like portable (small and light in weight), streaming (spe- 
cial feature) for televisions; and modifiers like evening (restric- 
tions on color and materials), small (size restriction) for handbags 
etc. A subset of these category-modifier pairs, together with mod- 
ifiers for another six categories, will be used in our experiments in 
Section]?] In the next section, we present the details of computing 
the association between modifiers and AV pairs. 

3.4 Estimating Association Probability 

The goal of this step is to estimate, for each modifier m, the prob- 
ability s, with which an AV pair (a,-, v;) is relevant to this modifier. 
In order to compute this probability, we make use of the following 
observation. If a modifier is related to an attribute-value pair, it is 
very likely that queries that contain either of them will ultimately 
lead the users to the same domains. Therefore, we can leverage 



Algorithm 1: Postulated generative model of modifiers and AV 
pairs. 

Pick a domain d according to P(rf) 
Pick an attribute a according to P(a|d) 
Pick a value v according to P(v|a,d) 
Pick a modifier m according to P(m|d) 



the browse trails to compute the association of an AV pair with a 
modifier, and normalize across domains to compute the required 
probabilities. 

In particular, we postulate a generative process of modifiers and 
AV pairs as described in Algorithm [T] A domain d such as www. 
sony . com is chosen according to some prior probability over do- 
mains, ¥(d). Once the domain is chosen, the modifier such as 
portable becomes conditionally independent of the AV pairs. An 
AV pair (a,v) is generated by first choosing an attribute according 
to the domain, and then choosing a value based jointly on that at- 
tribute and the domain. The values are conditioned on the attribute 
and the domain since value distribution is also influenced by the do- 
main under consideration. For instance, for a brand-centric domain 
such as www. sony . com, we would expect P(v|a,ii) to peak at a par- 
ticular value (in this example, sony) for attribute manufacturer, 
while for domains such as www.walmart.com ¥(v\a,d) will be 
more uniform. Using this generative model, we can write the joint 
distribution as a product of conditional distributions defined by the 
generative process: 



F{(a,v),m,d) = P(d)V(a\d)V(v\a,d)V{m\d). 



(1) 



Each of these quantities can be directly obtained from the frequency 
counts computed in Section[3^2] 

By marginalizing the joint distribution with respect to the do- 
mains, we obtain 

F((a,v),m) =Y i F(d)F(a\d)V(v\a,d)V(m\d) (2) 
d 

The conditional probability of an AV pair (a, v) given a modifier m, 
which we call the association score, is given by the chain rule of 
probability, and equals: 



P((fl,v)|m) 



(3) 



P((a,v),m 
I m , eM P((fl,v)eAV/m') 

Continuing with our example, Table|4]shows a portion of the re- 
sults at the end of this step for the modifier portable. Note that the 
association scores approach has successfully identified brands that 
are the major manufacturers of portable televisions. Later in the 
paper (Section |4~5| >, we will show that it can also successfully iden- 
tify small diagonal sizes as being highly associated with portable 
after clustering diagonal sizes. On the other hand, it has also identi- 
fied certain model numbers that correspond to televisions with large 
screen sizes (e.g., kdl40s5100) as being relevant for this modifier. 

While this provides an initial candidate set of mappings between 
the modifier and the AV pairs, we would like to obtain those map- 
pings that (a) bring in combinations of AV pairs that are more pre- 
cise to the modifier, and (b) can generalize to the subspace of prod- 
ucts that are relevant to the corresponding modifier. To do so, we 
make use of the product catalog as explained in the next step. 

3.5 Generating Rewrites 

The final step of our algorithm produces sets of AV pairs that can 
satisfy the modifiers using the product catalog and the AV pairs 
with association scores as input. This is an important step as the 



Category Candidate modifiers 



Refrigerators commercial, compact, counter-depth, best, freezerless, outdoor, small, side-by-side, undercounter, efficient 

Air Conditioners central, ductless, home, remote, best, small, efficient, commercial, quiet, evaporative 

Knives electric, professional, gourmet, best, chef's, safe, home, ultimate, outdoor, expensive 

Ovens commercial, electric, single, standing, convection, freestanding, professional, outdoor, top, downdraft 

Dishwashers best, countertop, quiet, tall, efficient, compact, built-in, portable, home, professional 

Handbags evening, small, big, newest, popular, oversized, casual, exotic, designer, plus 

Televisions portable, remote, streaming, flat, largest, hd, biggest, compatible, thin, built-in 

Radar Detectors remote, newest, solar, portable, built-in, satellite, powerful, max, luxury, maximum 

Voice Recorders portable, tiny, interactive, remote, powerful, thin, cool, hd, deluxe, built-in 

Jackets kids, women's, girls, retro, hooded, insulated, running, maternity, designer, distressed 



Table 3: Top ten valid modifiers for different categories. 
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Table 4: Conditional distribution of AV pairs given the modifier 
portable in the television category 



attribute values identified earlier in the process are limited to ones 
that appear in user queries; these terms are often skewed towards 
certain attributes depending on the category. For example, in the 
category of electronic products, a large fraction of queries consist 
of solely a model number, and hence a modifier is often associated 
with a set of model numbers. We can vastly improve recall by 
figuring out how these products manage to satisfy the query. 

Intuitively speaking, we would like to find a set of AV pairs that 
is both specific, i.e., pinpoints as many attribute values as possible, 
especially for important attributes, and contains as large a fraction 
of the products that match the AV pairs with high association scores 
with the modifier as possible. Further, observing that AV pairs that 
are common in the catalog are more likely to be associated with 
a modifier by random chance, we want to weight the association 
scores of the AV pairs inversely proportional to the number of the 
products with that AV pair in the catalog. This leads us to the con- 
cept of coverage scores. 

Definition 1 (Coverage Scores). Let the set of AV pairs 
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which measures how well product p satisfies the set C. 
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Figure 2: The coverage score of a set of AV pairs that include 
attribute a\, a^, 04, satisfied by products p\, P2, p$, p$, P(, 
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for which w(S) can be interpreted as the fraction of weights that 
the products satisfying S covers among all products, and z(S) the 
relative importance of the attributes covered in S. 

We illustrate the concept of coverage scores in Figure [2] In the 
figure, each row corresponds to a product in the database P. A row 
is higher if it has more weights according to the weight function 
induced by the set C of AV pairs and association scores computed in 
the last step. Each column corresponds to an attribute. A column is 
wider if the attribute is deemed more important by z- The coverage 
score thus measures the fraction of area covered by a set of AV 
pairs in the space of all relevant products and attributes. 

Unfortunately, we were unable to determine if there exists an 
efficient algorithm to find the set of AV pairs that maximizes the 
coverage score. In this study, we develop a heuristic solution that 
draws on the ideas of finding frequent itemsets . In our setting, 
an item corresponds to a particular AV pair, and an itemset corre- 
sponds to a set of AV pairs. Typically, given a set of baskets B and 
a desired minimum support threshold 6, an itemset algorithm finds 



Algorithm 2: Identify Product Features 

input : A set C = {(at, Vj,Sj)}^_j of AV pairs (a,-, v,-) with 

association scores s,, product database P, importance 
values of attributes z 

output: A set ,5? = {(S/,c,)}" =1 of sets of AV pairs Sj with 
coverage scores cj 

P" <- 0, W <- 0; 
foreach (a,-, v,-,i/) e C do 

/"<-p({( fl/ ,v f )}) ; 

foreach product p e P' do 
Add p to database P"; 

w[p]<-w[p] + T^[; 

end 

end 

,5^0; 

for Ge (0,1) do 

^ •S-FlND-lTEMSET(P",W,eW); 

foreach S e ,9" do 

W (S) <~ LpeP(S) w [p]^( S ) <~ La:(a, v )esz( a ); 

c<-w(S)xz(S); 
Add (5,c) toJ^; 

end 

end 

return 5?; 



all maximal itemsets with at least 9 number of baskets that contain 
the itemset. It is easy to adapt the apriori algorithm 1 2 1 for finding 
itemsets to incorporate weights. Lemma [T] connects the problems 
of finding frequent itemsets and of finding the set of AV pairs with 
the highest coverage scores. 

LEMMA 1. A set of AV pairs with the highest coverage score 
must be a maximal itemset for some threshold 8. 

Putting it together, our algorithm for finding the set of AV pairs 
with the highest coverage score is given in Algorithm[2] We start by 
retrieving the set of products P' for each of the AV pairs identified 
in the previous step. For each retrieved product p, we insert it into 
a database P", and increase its weight in the weight matrix w by the 
association scores Sj divided by the number of products satisfying 
the given AV pair. We then loop through different values of support 
threshold ratios 6, and invoke the adapted itemset algorithm that 
works on weighted database to find all maximal itemsets with min- 
imum support ratios of 6. We compute the coverage scores for each 
of the itemsets, and add them to the output. Since both denomina- 
tors of w(S) and z(S) are constant relative to C, we only need to 
compute the numerator if we are interested in the relative ordering 
of the itemsets. Note that the algorithm as stated solves the more 
general problem of computing the coverage scores for sets of AV 
pairs. We ended up solving this problem as we find it helpful in the 
experiments to combine these sets heuristically at retrieval time. 

Based on Lemma[T] if we want to find the set of AV pairs with 
the highest coverage score, we will need to try all possible values 
of 8, which is infeasible. In the implementation of Algorithm [2] 
we conduct a e-grid search over the range (0, 1), and hence cannot 
guarantee finding the absolute best set. Nonetheless, in our exper- 
iments, we find that we end up with the same set regardless of the 
choice of the grid size e, provided e is smaller than 0.1. An inter- 
esting research question is to establish a guarantee on the quality of 
the candidate set as a function of e. 
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Kitchen Appliances 




AC Airconditioners 


central, commercial, ductless 


DW Dishwashers 


portable, quiet 


KN Knives 


chef's, gourmet 
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1 Id Handbags 


casual, designer, evening 


JK Jackets 


designer, insulated, kids, retro 


WL Wallets 


designer, stylish 


Consumer Electronics 




DP DVD Players 


portable, remote, streaming 


RD Radar Detectors 


portable, remote 


TV Televisions 


portable, remote, streaming 


VR Voice Recorders 


portable, remote 



Table 5: Categories and modifiers evaluated in experiments. 



4. EXPERIMENTS 

We conducted a series of experiments to evaluate our approach 
for rewriting queries that contain modifiers. The first two exper- 
iments focus on the association scores, and evaluate the absolute 
and relative relevance of the selected attribute values. The third 
experiment evaluates the end-to-end user experience of the search 
results retrieved based on the reformulated queries compared to that 
of using the original queries. 

4.1 Data Preparation 

We obtained the search and browse histories from consenting 
users of a popular browser toolbar over a 6-month period between 
November 2010 and April 2011. We first classified the queries 
into product categories using a Naive Bayes text classifier, and 
selected a number of categories of products for which there are 
an abundance of queries (over 10,000 queries per category). The 
queries from the selected categories were then parsed and anno- 
tated with type semantics on the techniques described in [20|. Next, 
we processed these histories as described in Section [3~2| A set of 
valid modifiers was selected based on the criteria described in Sec- 
tion |3.3| The subset of modifiers and categories (from Table [3} 
used in the experiments is given in Table [5] We restricted the ex- 
periments only to these subsets based on the number of queries 
each modifier-category pair received and the uniqueness of the cat- 
egory under the top-level categories (of consumer electronics, 
kitchen appliances, and clothing and accessories). 

For each category, we retrieved product details from a product 
catalog where each product is specified as a set of attribute-value 
pairs as described in Section|3] We then examined the distribution 
of attributes and kept only the attributes for which at least > 10% 
of the products have non-null values. Further, we restricted our 
analysis to only the categorical attributes such as brand, model, 
and color in a category. The reason behind this decision was due 
to the sparsity of numeric data in our catalog. We will show in 
Section|43]how our approach can be extended to handle numeric 
attributes such as diagonal size and width. The resulting data after 
pre-processing constitutes the product database that we use through- 
out the rest of the section. 

4.2 Evaluation of Identified Attribute Values 
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Figure 3: Agreements rates of the selected attribute values by 
categories. Categories that belong to the same major category 
are further grouped by colors. 



In the first experiment, we evaluated the absolute relevance of 
the attribute values with high association scores for a modifier. The 
experiment is set up as follows. For each modifier of each cate- 
gory, we selected the five values with the highest association scores 
for each attribute. We then asked human judges to rate whether 
each of the selected attribute values are relevant to the modifier 
in question. The judges are given three options: relevant, not 
relevant, and unable to decide. Each attribute-value pair is 
evaluated seven times. We post-process the results by filtering out 
any judgment that fails a simple sanity testQ 

We measured relevance by agreement rates, defined as 

#relevant 
#relevant +#not relevant 

The results by categories are shown in Figure|3] 

Across the 12 categories, the average agreement rate of the se- 
lected attribute values is about 95%. The results are consistent 
across the top-level categories, with the average agreement rate for 
kitchen appliances, clothing and accessories, and consumer elec- 
tronics being 94%, 95%, and 97% respectively. 

We present in Table [6] some anecdotal examples of the attribute 
values presented to human judges. For portable dishwashers, 
the association scores rank the top manufacturer to be Danby fol- 
lowed by Edgestar. To verify this result, we manually examine a 
number of commerce portals and found that 24" Danby dishwash- 
ers are considered a good candidate for portable dishwashers. Like- 
wise, we examine the manufacturers selected for evening handbags. 
We found that Sydney Love handbags are typically more colorful 
and small compared to other manufacturers, two aspects that are 
usually associated with evening handbags. In the category of re- 
frigerators, we find that people who search for small refrigerators 
often look for wine chillers, beverage coolers etc. Our results bear 
out our hypothesis. Finally, going through the products pages of 
dishwashers made by Dacor and made by Fisher and Paykel, we 
found words such as WhisperWash and Quiet that are used to de- 
scribe the features of these dishwashers on Amazon. 

4.3 Evaluation of Relative Ordering of Attribute 
Values 

In the second experiment, we evaluated whether attribute val- 
ues with higher association scores are considered more relevant 

'We deliberately left some entries blank, and caught and threw out 
judges who assigned relevant to such entries. 



Ductless air conditioners 
Portable dishwashers 
Quiet dishwashers 
Gourmet knives 
Small refrigerators 
Small refrigerators 
Designer handbags 
Designer handbags 
Evening handbags 
Evening handbags 
Designer hats 



Manufacturer 

Manufacturer 

Manufacturer 

Manufacturer 

Fridge Type 

Manufacturer 

Manufacturer 

Bag Material 

Manufacturer 

Bag Material 

Product Name 



Sanyo, Comfortaire 

Danby, Edgestar 

Fisher and Paykel, GE, Dacor 

Lincoln, Stanley 

keg coolers, built-in 

Edgestar, Krups, Avanti 

Chanel, Hermes, Balenciaga 

leather, fur 

Sydney Love, Buxton 

straw, fabric 

Bermuda, Cowboy 



Table 6: Anecdotal examples of attribute values with high as- 
sociation scores for select modifiers 
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Figure 4: Fraction of judgements that consider top five at- 
tribute values being more relevant than bottom five attribute 
values by categories. Categories that belong to the same major 
category are further grouped by colors. Graph is centered at 
50%, the expected result when the top five and the bottom five 
attribute values are equally relevant. 



by the users than ones with lower scores. The experiment is set 
up as follows. For each modifier of each category, we selected 
the five attribute values with the highest and the lowest association 
scores. We then created pairs of attribute values, one drawn from 
the top and another from the bottom, and asked human judges to 
rate which of the two values is more relevant. Note that the bot- 
tom five attribute values are usually still relevant to the modifiers, 
as they have non-zero associations with the modifiers. Therefore, 
an absolute test of whether an attribute value is relevant is inappro- 
priate and insufficient, and a relative test is needed. 

We measured success by the fraction of the judges that find an 
attribute value from the top five is more relevant than one from 
the bottom five. The results are shown in Figure [4] For ease of 
interpretation, we center the graph at 50%, the fraction one expects 
if there is no signal in the association scores. Hence, a bar above the 
line indicates agreements with the scores, whereas below indicates 
disagreements with the scores. Note that not all categories that 
appeared previously are present, as less than five attribute values 
were identified for some categories. 

Across the eight categories, the average fraction of judges in fa- 
vor of the top five attributes is 58%. Given this fraction is computed 
with more than 2400 observations, and the result is statistically sig- 
nificant under a one-proportion z-test (p-value < 0.0001) against 
the null hypothesis that the top and bottom five attribute values are 
equally good. However, while we obtained good results for 6 out 



Modifier & Category 


Attribute 


High Assoc. 


Low Assoc. 


Small refrigerators 


Manufacturer 


Edgestar 


Princess 


Undercounter refrigerators 


Manufacturer 


U-line 


Zanussi 


Evening handbags 


Manufacturer 


Buxton 


Hermes 


Evening handbags 


Manufacturer 


Sydney Love 


Chanel 


Stylish jackets 


Manufacturer 


Joe rocket 


Arc'teryx 


Streaming dvdplayers 


Manufacturer 


Samsung 


Akai 


Portable televisions 


Manufacturer 


Emerson 


LG 



Diagonal Size(cm) 


Association score 


0to40 


0.68 


40 to 80 


0.65 


80 to 120 


0.11 



Table 9: Mapping of portable televisions to diagonal sizes 



Table 7: Anecdotal examples of attribute values with high and 
low association scores for select modifiers 



of 8 categories, we did poorly for handbags and voice recorders. 
We examined the results in detail and found that one possible ex- 
planation is that the manufacturers with high association scores for 
evening handbags and portable voice recorders are typi- 
cally lesser known brands that specialize in that specific segment 
of products, while the ones with low association scores are typi- 
cally better known brands that produce all lines of products, and 
the judges consistently favor the better known brands. We discuss 
this potential experimental bias further in Section |4~6| 

We present some of the anecdotal results that point to the dif- 
ference between the attribute values with high and low association 
scores in Table [7] In the top-level category of electronics, for dvd 
players, examining the product listings on Amazon suggests that 
Samsung is more popular in selling streaming (i.e., wi-fi enabled) 
dvd players than Akai; likewise, for televisions, Emerson is better 
known for TVs with smaller diagonal size (20 inches and below) 
than LG electronics, which produces televisions of all sizes, es- 
pecially large LCD TVs. 

4.4 End-to-end Evaluation of Query Rewrites 

In the final experiment, we evaluated whether our query refor- 
mulation technique leads to improved relevance in the results re- 
trieved. The experiment is set up as follows. For each modifier 
of each category, we consider the set of AV pairs with the high- 

We generate a 



Power Output (BTU) 


Association Score 


> 15000 


0.22 


12000 to 15000 


0.16 


8000 to 12000 


0.15 



est coverage score as produced by Algorithm 
query reformulation by concatenating together the attribute values. 
In many cases, we found that the coverage score is low due to a 
large number of missing values (nulls) in the database. To address 
this database quality issue, we heuristically combine the sets of AV 
pairs with highest coverage scores for a modifier-category pair if 
the attributes are disjoint. This heuristic is based on the observation 
that a missing value may take on any valid value from the domain, 
and hence it is sensible to combine multiple sets of AV pairs that 
are disjoint on the set of attributes selected. The problem of data 
sparsity will be addressed further in Section [4~6] We issued both the 
reformulation and the original query to Amazon, and asked human 
judges to rate which of the results are more relevant to the original 
query. The results are presented in Table[8] 

In 15 out of 18 queries (87%), the judges prefer the reformula- 
tion over the original query. Examining the result pages for both 
queries, we find that our results are better partly because of the 
products they retrieved, and partly due to the inclusion of solely 
products that belong to the category in question. For example, 
the query [ductless air conditioners] retrieved a mix of air 
conditioners, books, and remote controls, whereas the reformula- 
tion [s any o mini split air conditioners] retrieved only air 
conditioners. This example highlights the danger of treating each 

2 As part of the input to the computation of coverage scores, we 
need importance values for the attributes. For this experiment, we 
treat all attributes as equally important. 



Table 10: Mapping of central air conditioners to power out- 
put 



query word as a keyword, as the keyword often cause unintended 
matches. 

4.5 Handling Numeric Attributes 

Thus far, we have only considered categorial attributes in our 
approach. While in principle none of the four steps of our algo- 
rithm depends on attributes being categorical, in practice numeric 
attributes such as diagonal size and price pose additional chal- 
lenges as they can take on many values. A direct application of 
the algorithm that treat each different numeric value as unique will 
often result in extremely low association scores, and subsequently 
exclusion of these attributes from the reformulation due to low cov- 
erage scores. 

To work around this problem, one should start by grouping the 
numeric values into a small number of buckets and treat all val- 
ues within a bucket as equivalent. The grouping can be done using 
standard database histogram techniques such as by equal width or 
equal depth. The resulting buckets can then be treated as categori- 
cal attributes and our algorithm can proceed as before. 

Due to data sparsity in our catalog, we did not manage to suc- 
cessfully apply this technique across all categories of products. In 
some category, for example refrigerators, the availability of nu- 
meric data is strongly skewed towards the large refrigerators, ren- 
dering the technique inapplicable for modifiers that involve sizes. 
Nonetheless, we found some success for some other categories. 

For the category of TVs, we grouped the diagonal size val- 
ues into 5 equal width buckets — to 40cm, 40 to 80cm, 80 to 
120cm, 120 to 160cm, and 160cm and above; this width is selected 
to spread the TVs out well (approximately equal depth). We then 
proceed with the algorithm and computed the association scores of 
diagonal sizes with the modifier portable. The results are pro- 
vided in Table|9] As one can see, after grouping the diagonal sizes 
into 5 sizes, our approach has selected the smaller TV diagonal 
sizes as being associated with portable. We try varying the buck- 
eting (for example to equal width buckets of 20cm instead of 40cm) 
and obtain similar results. 

For another category, we consider air conditioners. The 
numeric attributes of interest in this category are power output. 
Consider central air conditioners. Table[l0]shows the map- 
ping of the modifier to the attribute values. The results confirmed 
our intuition that users looking for central air conditioners tend 
to look for ones with high power output than say compared to room 
air conditioners such as mini-split air conditioners. 



Category Original query 



Rewritten query 



Which query is better? 



AC 


Ductless air conditioners 


Sanyo mini split air conditioners 


rewrite 


AC 


Commercial air conditioners 


Haier portable air conditioners 


rewrite 


AC 


Central air conditioners 


Haier portable air conditioners 


rewrite 


DW 


Quiet dishwashers 


Maytag stainless steel dishwashers 


original 


DW 


Portable dishwashers 


General electric stainless steel dishwashers 


re wri te 


KN 


{"'ripfQ Vmvp'; 

JV111VCS 


^X^ll<:t^^^>f Vnifp cpfs 


rp wri tp 


KN 


Gourmet knives 


Wusfhof knife sets 


rp wri tp 


OV 


Freestanding ovens 


General electric stainless steel ovens 


original 


RF 


Small refrigerators 


stainless steel Danby refrigerators 


rewrite 


RF 


Counter-depth refrigerators 


stainless steel Samsung refrigerators 


rewrite 


HB 






rpn/n tP 


HB 


Evening handbags 


Sydney Love fabric handbags 


original 


JK 


Insulated jackets 


waterproof jackets 


rewrite 


JK 


Designer jackets 


leather jackets 


rewrite 


JK 


Kids jackets 


waterproof jackets 


rewrite 


RD 


Remote radar detectors 


k-band city vg2 immunity radar detectors 


rewrite 


TV 


Portable televisions 


Samsung tft active matrix led hdtv 


rewrite 


VR 


Portable voice recorders 


Sony icd digital voice recorder 


rewrite 



Table 8: End-to-end evaluation of query rewrites. 



4.6 Discussion of Results 

Our approach to inferring and associating AV pairs to modifiers 
is based on wisdom of crowds through the use of browse trails. 
Using the browse trails of a large number of users, we associated 
modifiers to the attribute value pairs, and used these associations 
along with their probabilities to infer sets of AV pairs that best de- 
scribe the semantics. 

Our first experiment showed that the top scoring attribute-value 
pair associations are highly relevant to the modifier. This validates 
our assumption that the domains that the users reach are likely to 
be similar for the queries that contain a modifier and its associated 
attribute-value pairs. Therefore, by tracing the trails of domains, 
one can find reliable associations. Further, our technique, for each 
modifier, is also accurate in determining the ordering of these as- 
sociations, as shown by our second experiment. The third experi- 
ment shows the importance of generalizing the associations to en- 
able better recall, both by adding additional attribute value pairs 
not present in the queries, and by identifying more holistic repre- 
sentation of the associations. In 87% of the cases, the re-written 
query using our approach resulted in retrieval of products consid- 
ered more relevant by the users. 

There are certain limitations to the experiments presented. First, 
Mechanical Turk experiments are often noisy, and human judges 
could be subject to different sources of biases. As noted before, we 
found that the human judges in the experiments have exhibited a 
bias towards better known brands. This bias does not necessarily 
work in our favor as our approach is not designed to take advantage 
of popular brands. Indeed, for a number of cases our algorithm 
ends up selecting lesser-known specialty brands over well-known 
brands. Second, human judges may not be always knowledgeable 
about the particular products. As discussed in the Introduction, 
users issue queries containing modifiers partly due to lack of do- 
main knowledge and do so to seek help from the search engine. 
They may be unfamiliar with the manufacturers that specialize in 
making streaming DVD players or in making quiet dishwashers. 
However, note that we are not asking the human judges to come up 
with the mapping, but rather to validate the algorithmically gener- 
ated mapping, a relatively simpler task. Further, we gave the op- 



tion of unable to decide to the human judges, and our results 
are aggregated over many judgments. As a further safeguard, we 
complement the Mechanical Turk experiment with a careful man- 
ual examination of a sample of the selected attribute values, and 
confirm that most of these attribute values are highly related to the 
given modifiers. 

Finally, we would like to observe that our solution takes as input 
from several components and its ultimate success depends on the 
precision of these components. Problems in these components can 
manifest itself in a variety of ways. For example, queries can some- 
times be misclassified, and annotations may confuse one attribute 
with another. Such errors often lead to poor estimates of associa- 
tion scores. The product catalog we work with has many missing 
values. This limits the success of our final reformulation step as 
all the coverage scores were depressed. Improving the quality of 
the catalog will certainly improve our results, and this was con- 
firmed in a smaller-scale experiment where we manually scrubbed 
the air conditioner database and obtained better results. Nonethe- 
less, despite these limitations, our empirical experiment suggests 
that our approach can generate good query reformulations, and val- 
idates our belief that there are signals in the browse trails that can 
be harnessed to address the challenging problem of serving queries 
with modifiers. 



5. CONCLUSIONS 

We study the problem of query reformulation in commerce search 
that map queries containing modifiers to ones that specify pre- 
cise attribute values of the products to be retrieved. We did so by 
combining user behavior data and the product catalog of a com- 
merce search engine to produce the mappings. The user behavior 
data provides us with an initial association of attribute values with 
modifiers, the signal of which was then amplified and generalized 
through the use of the product catalog to identify common features 
of the products that satisfy the selected attribute values. As part of 
a comprehensive user study, we find that users agree with the at- 
tribute values selected by our approach in about 95% of the cases 



and they prefer the results surfaced for our reformulated queries to 
ones for the original queries in 87% of the time. 

There are several future research directions suggested by this 
study. First, our work has focused on approaching the challenge 
of answering queries containing modifiers through query reformu- 
lation. This was done in the context of treating the existing backend 
serving infrastructure of commerce search engines as given. An in- 
teresting challenge is to develop an end-to-end solution that directly 
retrieves products for the given query. We believe it is possible to 
develop a good solution by intelligently aggregating the top few 
sets of AV pairs with highest coverage scores. Second, our solu- 
tion does not explicitly take into account the noise introduced by 
the components it relies on nor the data sparsity of the catalog. As 
explained in Section |4~6| these have negative impacts on the quality 
of the rewrites. While improving the data quality or the component 
quality is outside of the scope of this study, we believe it may be 
possible to produce better rewrites if these considerations are in- 
corporated as part of a probabilistic framework. Finally, there are 
other search domains where structured data exist and where it is 
common for users to issue queries containing terms that cannot be 
directly associated with the structure data, for example, in travel 
and in health. It will be an important challenge to extend the tech- 
niques presented in this work to these other structured domains. 
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